3DS : Cherry Blossom 10-Mile Race

Background:

The Credit Union Cherry Blossom (CUCB) is a non-profit organization that manages an annual 10-mile race in Washington, D.C. This race occurs the first Sunday of each April and contestants are admitted to run the race based on a lottery entry system implemented in 1980. The race has rapidly grown in size with 1977 being the first year that had to limit voluntary runners. The race is currently filled with a diverse group of participants that features professional and amateur runners across different ages. Using the data collected from the CUCB website, we are exploring the relationship between age and fitness.

Data Extraction Method:

We created a Jupyter notebook to conduct the data scraping from the webpage: <https://www.cballtimeresults.org/performance-search/?eventType=10M&year=1973&division=M&page=1>. The Python library “requests” was used to connect to the URL, and “BeatifulSoup” was used to parse through the HTML and extract the data. We wrote a function that iterated through the specified variables (section, division, year, and page) to ensure all the data was collected. The CUCB website contains data for the following columns: Name, PiD/TiD, PiS/TiS, Age, Time, Pace, Division, and Home Town. The data was scraped for men and women between 1973 and 2019. 1973 was the first year the event was held, and 2019 was the year before organizers canceled the event for the first time due to COVID-19.

We incorporated weather data from the National Oceanic and Atmospheric Administration (NOAA) and the National Centers for Environmental Information. The closest option to the marathon was the Washington Reagan Airport, Arlington, VA, situated on the banks of the Potomac across from the National Mall, where most of the marathon takes place. From that data set, we added precipitation and minimum and maximum temperatures for each day the race took place. Each event date was manually recorded from the Rite of Spring pdf that detailed the history of the marathon and used to join only the relevant data for each event date.

Describing our Data

Dataset Overview:

In the original data set we have 347,402 rows and 17 columns. 8 of the variables were excluded due to their lack of relevance. After cleaning the date we ended up with a data frame of 339,934 rows and 9 variables). 7,468 rows of data were omitted from the data due to missing values for the time and/or age variables. Table 1 and table 2 show all of the variables we considered and worked with.

Table 1: Variables

Variable Names Data Types Variable Descriptions How does this variable contribute to project objectives?
Year Integer Year the race was held. Data is spanned over several years giving us the ability to see how people who ran the race for multiple years changed in times.
Date Character

Date on which the race was held in year-month-day format.

example: 1973-04-01

Was used as a key to join the weather data
Name Character

An individual’s first and last name with varying formats. Most of the CUCB website results for names also list an ‘M’, ‘F’, or ‘W’ in parenthesis for the individual’s sex.

example: James Yenckel (M)

Used as a personal identifier source for each runner. We plan to use this variable in the future to find runners that ran several times and see how their run times changed. This will be done in our final project not necessarily in this data visualization.
Age Integer Age of runner at time of race. One of our main variables used to see how performance differs across age ranges and whether younger people have better times.
Time Time/Numeric Time in hr:min:sec format to run 10 miles. This is how long it took each runner to complete the race. One of our main variables used as a metric of fitness. We are looking at the relationship between age and time
Division Character

28 different divisions are contained, 14 in each sex. They range from 4 of
them having 20 year ranges, while the rest have 5 year ranges. Each division is an alphanumeric code separating competitors by sex and age. The example shows 25 to 29-year-old women.

example: W2529

This variable is used to break data into age groups which we then use to draw conclusions about our question.
pos_by_division Integer This variable gives us the position that a runner finished in their assigned division for a certain year. Excluded. Look below for more information on why.
total_by_division Integer This variable gives the total number of individuals in each division for a certain year. Excluded. Look below for more information on why.
pos_by_sex Integer Shows the place that a runner finished by sex per year. Excluded. Look below for more information on why
total_by_sex Integer The total number of competitors overall for a sex per year. Excluded. Look below for more information on why.
Sex Character Gender of runner. This is an important variable since it allows us to compare the two sexes and their times as compared to viewing them together.
Hometown Character Hometown of runner. Excluded. Look below for more information on why.
PRCP Numeric Precipitation recorded as daily rainfall in inches to one decimal place collected by NOAA. This variable does not contribute directly to our project objectives but we want to evaluate it as another predictor for running times along with other weather data.
TMIN Integer Minimum daily temperature recorded in Fahrenheit, collected by NOAA. This variable does not contribute directly to our project objectives but we want to evaluate it as another predictor for running times along with other weather data.
TMAX Integer Maximum daily temperature recorded in Fahrenheit, collected by NOAA This variable does not contribute directly to our project objectives but we want to evaluate it as another predictor for running times along with other weather data.

Table 2: Variables and Data Excluded from Analysis

What was excluded/modified Reason for exclusion/modification
Hometown Many missing values and inconsistencies were found in the data entries. We found a few individuals reporting their hometown differently each time they ran the race or just reporting several at once. Due to this we decided to remove this variable from our analysis because there are no accurate conclusions that can be drawn. Also this is not a variable we could use to fulfill our main objective, so we chose to exclude it from our analysis.
Date We decided to exclude this variable since it was only useful as a key for joining the weather data.
pos_by_sex This variable gave us the position that a runner finished based on sex of each year. We decided to not use this variable in our data visualization since it was not important to meet our goal.
total_by_sex This variable gave us the total number of people by sex per year. We decided to not include it in our analysis since it was not important for visualizing our main goal in exploring this data.
pos_by_div This variable gives us the position that a runner finished in their assigned division for a certain year. We decided to to not use this variable in our data visualization since it wasn’t important to drawing conclusion for our main goal.
total_by_division This variable gives the total number of individuals in each division for a certain year. The divisions are the same as described above and are excluded for the same reason as above.
Pace The Pace gave the pace per mile of each runner for the race. We decided to exclude this from our analysis because it is redundant as the time divided by 10.
Data from the year of 1973 After we cleaned the data, we only had 2 entries left that had an age and time recorded so we decided to exclude it from our analysis. The history of the event indicates that record keeping was less reliable in the earlier years and we decided to avoid this year entirely.
Data from the year of 1977 We decided to remove the data from the year 1977 due to the fact that there was a large subset of data missing from the times right in the middle of the race time. We are not given information about what lead to the missing times but due to the large volume of the missing values we decided to omit the year entirely as biased data.
Data from the year of 2015 (not yet modified)

The distance of the race ran in 2015 was only 9.39 miles long. Because the data was still good as a whole, we decided to modify the race times using the overall time so the data would be as that of a 10 mile race. We did not have the time to do it for this data visualization project, but plan on doing it with our final.

We plan on making this modification by finding the pace of each runner in 2015, dividing that by 9.39 to obtain their pace per mile, and multiplying it by 10 to get their time if they kept that pace for 10 miles.. The times included here will not the actual times of the runners since they have been modified, but the times are estimated to be that of a 10 mile race to remove the bias. If this race would have been the full 10 miles, the times would have been slightly different, but that difference would have been very minimal and insignificant from our modified data of this year for this analysis. Therefore, we decided that this would be a good way to approach the times for this years race instead of just removing all of the data as a whole.

Data from the year of 2019 (not yet modified)

The distance of the race ran in 2019 was 80 yards shy of 10 miles. Due to this we decided to estimate the times for the same reasons as above. We haven’t done it yet for this data visualization project since we ran out of time, but we plan on doing so for our final.

We plan on doing this modification by finding the average pace for each runner in that year and multiplying that by 10 for each runner, giving us the race times for the runner for 10 miles. The same limitations and things should be taken into account for this new time as above. Our group decided it was best to modify this data then to remove it.

Challenges and Nuances in Our Data

Nuances

We faced a few challenges while working with this data and also discovered some nuances. Reading the history of the race year by year, we discovered some information that we took into account that resulted in potential biases. The ones that stood out the most to us was that races in 2015 and 2019 were not completely 10 miles long. We did not have time to address this issue with our visualizations, but it is something we plan on addressing in out final report. There were 7468 rows of data excluded from out data set because of their faulty recordings (either missing time or age), and as well as the years of 1973 (since there were only 2 data points), and the year of 1977 (which had a chunk of data missing from the middle of the running times).

There were a few nuances we discovered that could not be addressed. These were mostly around the weather of the day of the race and course re-routings. Around 5 of the race years had wind gusts during the range that ranged from 20-50 mph. This worked at a disadvantage for the runners, slowing them down. There was one year that had favorable wind gusts, so with that year, we would expect to see slightly higher running times due to that wind push. During one of the years, there was snow during the race, slowing down runners to almost a walk in some places due to the cold. In some years there were even large puddles on various parts of the track. The course of the race changed several times throughout the years, but it kept its length (except for the two faulty years described above).

Loading the Data and Data From Elsewhere/ Ignoring Data

As mentioned above, we uploaded data from NOAA to use with providing general weather stats for the day of the race. This data does not directly influence our question, but we are interested in analysing weather as a potential predictor that explains changes in running times. There was data that we loaded from the Cherry Blossom Race Website that we ignored, but all of those details can be found in table 2.

Challenges with the Data and Cleaning the Data

The data presented many challenges that we addressed to ensure more accurate analysis and results. The column for sex was created from extracting text at the end of the name entries between parentheses. However, that was still an incomplete set of entries for sex and many participants still had missing values. In order to address this we compared those values to the first character in the string for division (since they all start with M or W). This allowed us to use a more extensive set of data with sex as a factor. Since the data was all scraped as text and converted to csv the time variable was loaded as characters and converted into numeric times using the chron library. We noticed a pattern where many inconsistent marathon times were recorded as being 40 to 59 hours. We assumed these to be intended as 40 to 59 minutes as those are appropriate within the context of the dataset and converted them accordingly. Our data is unique because hard limits bound the time and age variables. We know that the oldest competitor to finish the marathon with a recorded time was 87. While there is no age minimum to the required to compete we wanted to set a reasonable minimum age to filter out some participant ages that we perceived as erroneous, so we set an age range from 8 to 87. For the marathon times, we know that the world record for 10 miles is 43 minutes, so there cannot be any times below that, and competitors did not have their times recorded if they exceeded 2 hours 20 minutes. This allowed us to filter out any erroneous times between those two values and prevented us from having to perform additional tests for outlying data. We know that the data collection process has improved over the years and was initially based on hand-recorded times; in modern times, the results have become based on GPS tracking and digital recording. The data is also challenging to work with due to the size. We have over 300,000 rows and the initial data scraping process took 97 minutes to run.

Summary Statistics:

Age, Time and Gender Statistics:

Table 3: Summary Statistics Overall
Age Time
Min. : 8.0 Min. :00:43:20
1st Qu.:29.0 1st Qu.:01:19:35
Median :35.0 Median :01:30:50
Mean :36.6 Mean :01:31:25
3rd Qu.:43.0 3rd Qu.:01:42:22
Max. :87.0 Max. :02:20:00

From table 3, the age range of contestants ranges from 8 years old to 87 years old. The mean age of a runner is 36 years old. 50% of all of the runners are between 29 and 43 years old.

The fastest time to finish a race was 43 minutes 20 seconds, while the slowest recorded was 2 hours and 20 minutes, although we know this time is the slowest allowed to be recorded in the race for it to count, so this is no surprise. The mean time to run this race is 1 hour, 31 minutes, and 25 seconds. 50% of the runners finished the race between 1:19:35 and 01:42:22.

Table 4: Summary Statistics for Females
Age Time
Min. : 8.00 Min. :00:48:35
1st Qu.:27.00 1st Qu.:01:28:00
Median :32.00 Median :01:37:31
Mean :34.58 Mean :01:38:11
3rd Qu.:40.00 3rd Qu.:01:47:46
Max. :87.00 Max. :02:20:00

From table 4 we get the average age to be 34 years old. 50% of female runners are between the age of 27 and 40. The minimum age is 8 and the maximum is 87. The fastest time a woman ran for this race is 48 minutes and 35 seconds. The average time to finish the race for females is 1 hour 38 minutes and 11 seconds, which is 7 minutes longer than the average time to run the race overall. 50% of female runners finished the race between 1:28:00 and 1:47:46.

Table 5: Summary Statistics for Males
Age Time
Min. : 8.00 Min. :00:43:20
1st Qu.:30.00 1st Qu.:01:13:54
Median :37.00 Median :01:23:43
Mean :38.55 Mean :01:24:54
3rd Qu.:45.00 3rd Qu.:01:34:41
Max. :87.00 Max. :02:19:58

From table 5 we get the average age to be 4 years higher than a females, 38 years old. 50% of the male runners are between the age of 30 and 45, which is a higher age range than that of females and the youngest and oldest individuals to run this race are 8 and 87, just like females. The fastest time to finish a race for males was 43 minutes and 20 seconds. The average time to finish a race for a men is 01:24:54 which is roughly 14 minutes faster than women. 50% of males finish between the times of 01:13:54 and 01:34:41.

From figure 1 we can see a trend of running times increasing along with the age groups. Figure 1 directly establishes the relationship between age and time that we are analyzing.

From figure 2 there were more males overall that ran the race in total. 166755 Females ran the race in total and 172459 Males ran the race in total.

Looking at figure 2, we can see that in the beginning years, there were more male than female runners. In roughly 2009 that switched, and now there are more females that run the race annually than males.

This statistic does not directly provide the answers we are looking for in our goal, but it is informative about our data.

Weather Statistics

Table 6: Weather Statistics for Rain and Temperatures
TMIN TMAX
Min. :32.00 Min. :44.0
1st Qu.:39.00 1st Qu.:56.0
Median :43.00 Median :64.0
Mean :43.11 Mean :63.3
3rd Qu.:47.00 3rd Qu.:70.0
Max. :58.00 Max. :84.0
Table 7: Did it Rain the Day of the Race?
Rain No_Rain
19 26

Examining the weather during the days of the race in table 6, we can see that the average weather ranged from 43 to 63 degrees on the day of the race. The lowest temperature of the day of the race throughout all of the years was 32 degrees F, while the highest was 84 F. Temperature minimums ranged from 32 degrees F to 58 degrees F throughout the years of the race. Temperature maximums ranges from 44 degrees F to 84 degrees F on the day of the race.

From table 7, during 26 of the race days, there was no rain, while 19 of the race days had rain.

This does not help directly answer our question, but it does give us insight into weather distributions for the race.

Exploring Data

Age Density Distribution Plots

<<<<<<< HEAD

This plot is a density plot (that overlays a histogram) of age =======

Figure 3 is a density plot (that overlays a histogram) of age >>>>>>> b057d7d62a8039b7e269758370c4f9ad23fb0c93 distribution for all of the years. We can see that the most common runners age is peaked at 28 years old. We can also see that most runners are between the ages of 20 and 60. The density of ages in the range of 20-30 tends to spike up high, but then as ages go up, the density of runners being in that age group goes down. This plot helps us visualize the overall distribution of ages and allows us to begin to evaluate the assumptions for further regression analysis.

Figure 4 is a density plot for age separated by year. We can see the general peak of ages and how it defers from year to year. In 1993 we see the highest concentration in density of older runners while in 1974 we see a peak of lower age runners. The year 1974 also looks the most non-normal and asymmetrical, probably due to a smaller amount of runners that year. We can see the same trend as with the overall graph with the concentration of runners being ages 25-30. The age with the most peaks also looks to be the same as with the overall graph, roughly 28 years old. This data helps give us a distribution of ages of runners in the race, it doesn’t directly answer our question, but gives more context to the age distribution.

Time Density Distribution Curve

<<<<<<< HEAD

This is a density plot (which overlays a histogram) of times. We can see that it is quite normally distributed with the peak denisty time =======

Figure 5 is a density plot (which overlays a histogram) of times. We can see that it is quite normally distributed with the peak density time >>>>>>> b057d7d62a8039b7e269758370c4f9ad23fb0c93 being centered at the average race time of roughly 1 hour and 31 minutes. We can see that the majority of times are between 1 hour and 6 minutes and 1 hour and 55 minutes. The distribution of times is normal. This doesn’t really answer our question about age and run times, but it helps us see where the general distribution of the data lies.

## Picking joint bandwidth of 0.00152

Figure 6 shows the density of time distribution by year. We can see that with each year of the race, the time concentrations of finishing the race shifted to taking longer. At first, most of the runners finished between shorter times compared to that of newer years. The year which had the highest concentration of runners finishing faster is 1978, the time for runners to finish peaked at below 01:07:30, which is really fast. The newer average to finish is centered at above 01:31:40. This is no surprise because the race was initially started among elite runners, and then transitioned into beings something that the general public can have access to.

In figure 7 we can see that the time to run the race increases with age, although not by that much. This is very useful for our analysis, since we are trying to determine whether the fitness of individuals goes down as one ages. Taking the trend lines of the scatter plots above, we can see that there is a trend of decreasing fitness with age.

In figure 8 we separate the two sexes males and females. We look at the overall Time and Age of each participant over the years. We can see they both have the pattern of time increasing with age. Between both sexes performance decreases as the age of the participants increase.

streakdf <- read.csv("cb_streakers2.csv", check.names = FALSE)

streakdf_long <- reshape2::melt(streakdf, id.vars = "Age", variable.name = "Name")
streakdf_long$value[streakdf_long$value == ""] <- NA
streakdf_long <- na.omit(streakdf_long)

colors <- rainbow(20)

plot_ly(streakdf_long, x = ~Age, y = ~value, color = ~Name, type = 'scatter', mode = 'lines',
         colors = colors) %>%
  layout(title = "Individual Performance \n Most Consectutive Races",
         xaxis = list(title = "Age"),
         yaxis = list(title = "Time",
                      dtick = 10, #tick space 10mins
                      tickformat = "%H:%M")) 

Model Fitting

We would also like to fit a model to see if there is a relationship between age, gender and weather conditions on the performance of runners in the CUCB run. We found that the best indicator for the performance in the race is gender followed by age and weather conditions.

\[ Performance(Time) = \beta_0 + \beta_1 Sex + \beta_2 Age + \beta_3 Pecipitation + \beta_4 Max Temps + \beta_5 Min Temps \]

The coefficient of determination \(R^2 = 20.25\%\). \(R^2\) tells us how much variability the above variables contribute to the performance. So from \(R^2\) we can say that we are missing around \(80\%\) of the contributions to the performance that may or may not be recorded in the race data. That could be because the data is not normally distributed and has other errors in it that don’t work well with linear models.

The fitted model takes the form:

\[ Performance(Time) = 5044.28 -834.94 Sex + 16.81 Age -559.14 Pecipitation + 9.82 Man Temps -7.92 Min Temps \]

We can see from the model that being being Female negatively effects the performance but so does weather factors like rain and minimum temperatures.

Assumptions

To fit a linear model we first need to make a few assumptions

  • The relationship between \(Y\) and \(\beta\)’s is linear.

  • The errors are normally distributed. We can check this by plotting a normal Quantile-Quantile Plot in figure 9

  • The errors are uncorrelated. But we see that it is not the case from figure 10.

  • The errors have constant variance and zero mean.

Further Analysis of Linear Models

For a better fit we want to look for variance-stabilizing transformations to correct the constant variance of errors. This will provide a fix for the basic assumption of linear models for constant variance. After the fix, we would be able to see more accurate model results. The model could potentially give us insight about the relationship between age and runner times/ fitness.

  1. Building a linear model with all the data did not give us any conclusive results but what if we can take a random sample where there is equal distribution among the ages or divisions and see performance/physical activity? Divide the data based on the divisions. Look at lab activity 3 Based on the size of the smallest division, randomly pick participants for all other divisions. Now build a model and check for linear association

  2. Another experiment is if we can see in a particular year if one group are longtime runners in the CUCB 10k while the other group is 1st time runners. Is there a difference in activity/performance based on age group? We will need to extract new data to get the names of longtime runners. (Kevin?) Pick out the years where there were no disturbances/external factors affecting the race, pick out the 2 groups of runners and look for any interesting relationships between the 2 groups.

  3. Effect of external factors on the overall performance of the runners. Do we see that the days of rain, snow and flooding increased the overall average runtime of the runners?

# Define age groups
ageGroup <- c("0-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70-89+")

# Add a new column 'AgeGroup' based on the age ranges
Performance <- as_tibble(df) %>% 
  mutate(AgeGroup = case_when(
    Age < 20 ~ ageGroup[1],
    Age >= 20 & Age < 30 ~ ageGroup[2],
    Age >= 30 & Age < 40 ~ ageGroup[3],
    Age >= 40 & Age < 50 ~ ageGroup[4],
    Age >= 50 & Age < 60 ~ ageGroup[5],
    Age >= 60 & Age < 70 ~ ageGroup[6],
    Age >= 70 ~ ageGroup[7]
  ))
#splitting time to seconds
 time_components <- strsplit(as.character(df$Time), ":")
 df$Time_in_seconds <- sapply(time_components, function(x) {
   as.numeric(x[1]) * 3600 + as.numeric(x[2]) * 60 + as.numeric(x[3])
 })

# Convert 'Time_in_seconds' back to "hh:mm" format for y-axis labels
df$Time_hh_mm <- sprintf("%02d:%02d", df$Time_in_seconds %/% 3600, 
(df$Time_in_seconds %% 3600) %/% 60)

division_models <- df %>%
  group_by(Division) %>%
  do(model_div = lm(Time_in_seconds ~ Age, data = .)) 

# View summaries of the models
division_summaries <- summary(division_models$model_div)
# Define percentiles
percentile <- c("10th", "20th", "30th", "40th", "50th", "60th", "70th", "80th", "90th", "100th")

# Create an empty matrix for cleaned data performance with 10 columns
Performance <- matrix(ncol = 12)

# Set column names for the matrix
colnames(Performance) <- c(colnames(df), "centile")

# Iterate over each year in the dataset
for (i in unique(df$Year)){

  # Filter data for the current year and calculate percentiles
  temp <- as_tibble(df) %>% 
    filter(Year == i ) %>% 
    mutate(centile = ecdf(.data$Time)(.data$Time))  

  # Append the data to the performance matrix
  invisible(Performance <- rbind(Performance, temp))
}

# Remove the first row (empty) from the performance matrix
Performance <- Performance[-1,]

# Convert the performance matrix to a tibble and create a new column 'Percentile'
Performance <- as_tibble(Performance) %>% 
  mutate(Percentile = case_when(
    centile < 0.10 ~ percentile[1],
    centile >= 0.10 & centile < 0.20 ~ percentile[2],
    centile >= 0.20 & centile < 0.30 ~ percentile[3],
    centile >= 0.30 & centile < 0.40 ~ percentile[4],
    centile >= 0.40 & centile < 0.50 ~ percentile[5],
    centile >= 0.50 & centile < 0.60 ~ percentile[6],
    centile >= 0.60 & centile < 0.70 ~ percentile[7],
    centile >= 0.70 & centile < 0.80 ~ percentile[8],
    centile >= 0.80 & centile < 0.90 ~ percentile[9],
    centile >= 0.90 & centile <= 1.0 ~ percentile[10]
  )) %>% 
  select(-centile)


ggplot(Performance,aes(y=Age, x=Time_in_seconds))+
  geom_point(alpha= 0.1, color="purple", size = .5)+
  labs(title = "Age vs Time scatterplots", x="Time (hr:min:sec)")+
  geom_smooth(formula = x ~ y, method = lm, se = FALSE)+
   stat_poly_eq(use_label(c("eq"))) +
  stat_poly_eq(label.x = 0.9) +
  scale_x_continuous(labels = function(x) as.character(hms(seconds = x), 
  format = "%H:%M")) +
  theme_bw()+
  facet_wrap(factor(Sex)~factor(Division), nrow = 4)
## Warning: Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Computation failed in `stat_smooth()`
## Caused by error:
## ! object 'y' not found

# Define percentiles
percentile <- c("10th", "20th", "30th", "40th", "50th", "60th", "70th", "80th", "90th", "100th")

# Create an empty matrix for cleaned data performance with 10 columns
Performance <- matrix(ncol = 12)

# Set column names for the matrix
colnames(Performance) <- c(colnames(df), "centile")

# Iterate for each Sex in the dataset
for (i in unique(df$Sex)){

  # Filter data for the current year and calculate percentiles
  temp <- as_tibble(df) %>% 
    filter(Sex == i ) %>% 
    mutate(centile = ecdf(.data$Time)(.data$Time))  

  # Append the data to the performance matrix
  invisible(Performance <- rbind(Performance, temp))
}

# Remove the first row (empty) from the performance matrix
Performance <- Performance[-1,]

# Convert the performance matrix to a tibble and create a new column 'Percentile'
Performance <- as_tibble(Performance) %>% 
  mutate(Percentile = case_when(
    centile < 0.10 ~ percentile[1],
    centile >= 0.10 & centile < 0.20 ~ percentile[2],
    centile >= 0.20 & centile < 0.30 ~ percentile[3],
    centile >= 0.30 & centile < 0.40 ~ percentile[4],
    centile >= 0.40 & centile < 0.50 ~ percentile[5],
    centile >= 0.50 & centile < 0.60 ~ percentile[6],
    centile >= 0.60 & centile < 0.70 ~ percentile[7],
    centile >= 0.70 & centile < 0.80 ~ percentile[8],
    centile >= 0.80 & centile < 0.90 ~ percentile[9],
    centile >= 0.90 & centile <= 1.0 ~ percentile[10]
  )) %>% 
  select(-centile)


centile_models <- Performance %>%
  group_by(Sex) %>% 
  group_by(Percentile) %>%
  do(model_centile = lm(Time_in_seconds ~ Age, data = .)) 

for(i in 10){
  plot(centile_models$model_centile[[i]],2)
}

ggplot(Performance,aes(x=Age, y=Time_in_seconds))+
  geom_point(alpha= 0.1, color="purple", size = .5)+
  labs(title = "Age vs Time scatterplots", y="Time (hr:min:sec)")+
  geom_smooth(formula = y ~ x, method = lm, se = FALSE)+
   stat_poly_eq(use_label(c("eq"))) +
  stat_poly_eq(label.y = 0.9) +
  scale_y_continuous(labels = function(x) as.character(hms(seconds = x), 
  format = "%H:%M")) +
  theme_bw()+
  facet_wrap(factor(Sex)~factor(Percentile,levels=percentile), nrow = 2)

Concluding Thoughts

Looking at all of the data visualization we have done, our next steps would focus on exploring the stats by age groups (divided by sex) and also the scatterplots. We plan to try extracting runners that ran several races and see if there is a difference with that trend as compared to individuals who only ran once (as in if the times change across an increasing age differently). We also want to try to bootstrap the different age groups and see if there is truly a difference between the ages and running times and then try to model it again. This data visualization helped us see what areas we need to focus on for our final analysis.